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Results of preliminary measurement have reacted 
upon the formulation of methods by indicating need for 
shift of emphasis or even different approach. For one 
thing, they have tended to magnify the importance of 
more accurate aural identification of sounds. For an- 
other they have indicated a need for study of ways in 
which the ear assigns positions or spacing to formants 
and how these positions are affected by formant shape, 
etc. In this work it has been advantageous to synthesize 
sounds so that the parameters may be independently 
controlled. 

Both audible form or relative positions of the for- 
mants, and position of the form in the pitch dimension 
are important in identifying the vowels. But it is be- 
lieved to be the form and position at the cortical level 
rather than in the incident sound that is important. 
This is because phonetically significant differences and 
similarities are recognized at this level. The differences 
and similarities are not specific. They must be expressed 
as distributions. For example, when a given sound is 
identified acoustically, we can only say that there is a 
' probability of X that it will be identified by ear as one 
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particular sound, a probability of Y that it will be 
identified as another, and so on. 

As the program proceeds and the results for the 
additional speakers and for the listeners are included, a 
better understanding of some of the factors that have 
been discussed may be realized. Although a good deal of 
work remains to be done the prospects for acoustical 
specification of the vowel sounds is encouraging. 
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This paper describes two ways of reducing the normally very complicated speech wave to simple geo- 
metrical form without destroying intelligibility, reports the results of articulation tests of the simplified 
speech, and gives two illustrations of things that can be done more simply with the simplified than with the 
normal wave. Both methods of simplification involve dichotomization of the amplitude scale and then 
quantization of the time scale. The first step is achieved by subjecting the speech wave to infinite peak 
clipping, the second by generating a rectangular wave whose switchings from one amplitude level to the other 
are related to the switchings of the clipped speech wave by one or the other of two rules. One of the rules 
yields marginal intelligibility with quanta 0.2 millisecond in length; the other requires slightly shorter 
quanta. The relative merits of the two methods are accounted for by an elementary application of informa- 
tion theory. Autocorrelation functions, which are especially easy to compute for the simplified waves, are 


shown. 


INTRODUCTION 


PEECH may be reduced to the form shown sche- 
matically in the top line of Fig. 1 by infinite peak 
clipping. The sound pressure wave is converted into an 
electrical voltage, and the latter is repeatedly amplified 
and clipped until all that remains is an irregular rec- 
tangular wave that switches whenever the original speech 
wave crosses the zero-axis. The rectangular wave, or the 
approximately rectangular wave that results when it is 
* The work reported here was a joint project of the U. S. Navy 
Electronics Laboratory, the Harvard Psycho-Acoustic Laboratory 
(under contract between Harvard University and the ONR, U. S. 
Navy, Project NR142-201, Report PNR-98), and the M.LT. 


Acoustics Laboratory (under contract between M.I.T. and the Air 
Force Research Laboratories, Cambridge). 


transduced back into acoustical form, is at least 
marginally intelligible. Higher intelligibility (over 90 
percent monosyllabic word articulation) may be ob- 
tained without losing the simple, rectangular wave form 
if the speech wave is first differentiated with respect to 
time and then subjected to infinite clipping. These facts, 
reported previously,’ indicate that a considerable 
amount of information is carried by the temporal pat- 
tern of the crossings of the zero-axis in the normal or in 
the differentiated wave. 


1N. B. Gross and J. C. R. Licklider, The effects of tilting and 
clipping upon the intelligibility of speech, Psycho-Acoustic Labora- 
tory Report PNR-11 (1946). J. C. R. Licklider and Irwin Pollack, 
J. Acous. Soc. Am. 20, 42-51 (1948). - 


TIME-QUANTIZED SPEECH 


Since infinite clipping dichotomizes the amplitude 
dimension and no further reduction can be achieved in 
amplitude, it is necessary to operate upon the temporal 
pattern if the speech wave is to be further simplified. 
The simplifying operation in the observations to be 
described is quantization of the time scale. In quantized 
time, a rectangular wave can switch only at prede- 
termined instants. At each such instant, it either 
switches or does not switch. This makes the wave easy 
to describe. For example, one can specify its pattern 
with a sequence of binary digits, 0 representing one 
amplitude level and 1 the other. 

The predetermined instants are set by a train of 
pulses that divides the time scale into intervals. Three 
pulse trains, yielding quanta of three different sizes, are 
shown in Fig. 1. Two methods of time quantization 
correspond to two rules for deciding whether, at the end 


of an interval, the quantized output wave does or does ° 


not switch. According to rule A, the output wave 
switches at the end of an interval if, during the interval, 
the input wave has switched one or more times. Ac- 
cording to rule B, the output wave switches at the end 
of an interval if, during the interval, the input wave has 
switched an odd number of times. 

With the time-quantizing train of pulses 7Q-1 in 
Fig. 1, the quanta are so short that the input wave never 
switches more than once per interval. Methods A and B 
therefore yield the same result. When the intervals are 
longer, however, the input wave sometimes switches an 
even number of times during an interval. At the end of 
such an interval, the output wave derived by method A 
switches and the one derived by method B does not. 
Under TQ-2, for example, A and B get out of step. 
Finally, when the quanta are very long (e.g., TQ-3), the 
probability is almost unity that the input wave will 
switch at least once during each interval. The output 
wave A then switches at the end of almost every interval 
and becomes a regular square wave as shown in the 
illustration. However, the probability that the number 
of switches of the input wave is even is about the same 
as the probability that it is odd. Output wave B there- 
fore switches at the ends of about half the intervals. 


ARTICULATION TESTS 


In order to determine the effects upon its intelligibility 
of subjecting speech to the treatments just described, 
articulation tests were conducted. They were made with 
the Psycho-Acoustic Laboratory PB Lists? of mono- 
syllabic words, recorded on magnetic tape, and played 
back through the clipping and quantizing apparatus. 
The setup is shown diagrammatically in Fig. 2. Except 
in preliminary tests, which showed that the effect of 
quantization was essentially the same for two quite 
different voices, the words were recorded by one male 
talker. The listeners were five male college students. 


2 The PB Lists were ‘published by J. P. Egan in Laryngoscope 
58, 955-991 (1948). 


Amplitude - Dichotomized Input 


Fic. 1. Time quantization. The amplitude-dichotomized input 
wave is the speech wave after it has been subjected to infinite peak 
clipping. From it, time-quantized output waves are derived by two 
methods, A and B, described in the text. The output waves are 
permitted to switch only at the instants marked by the trains of 
quantizing pulses. The pulse trains TQ-1, TQ-2, and TQ-3 produce 
temporal quanta of three different lengths. 


The RC circuits shown at 1 and 4 in Fig. 2 represent 
a compromise between the uniform response and the 
6-db-per-octave tilt used in the previous experiments on 
infinite peak clipping.’ The time constant T= RC was 
0.0001, the value found to be most nearly optimal for 
use in conjunction with the time-quantizing circuits. 

The results of the articulation tests are shown in 
Fig. 3. Average articulation scores are plotted against 
the number of thousands of time quanta per second. 
With fewer than 2000 quanta per second, the listeners 
understood essentially nothing. The quantized speech 
sounded like an impure tone in the case of method A and 
like static in the case of method B. With quantizing 
rates between 2000 and 10,000 per second, method B 
yielded higher intelligibility scores than did method A. 
With either method, vowel sounds were the first to be- 
come intelligible, presumably because for them the 
density of zero crossings in the input to the time 
quantizer is lower than it is for consonants. With 10,000 
or more quanta per second, intelligibility was approxi- 
mately as high as it was with the RC filtering and 
infinite clipping alone (95 percent). This last result was 
obtained only with method B; time quantizer A did not 
operate reliably above 8000 quanta per second. How- 
ever, extrapolation of curve A indicates that it would 
not be far short of curve B at 10,000, and the considera- 
tion discussed in connection with TQ-1 of Fig. 1 suggests 
that above 10,000 or 15,000 the two methods must yield 
essentially the same result. 

Two qualifications of the results shown in Fig. 3 
should be emphasized immediately. First, the amplitude- 
and time-quantized speech sounded worse than the 
articulation scores suggest. This is due in part to the fact 
that infinite peak clipping makes the noise between 
words as strong as the speech itself and in part to the 
fact that the quantization rate usually bore no simple 
relation to the harmonic frequencies of the voiced speech 
sounds. Second, considerable training was required be- 
fore the members of the listening crew achieved the level 
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Fic. 2. Block diagram of test set-up. Recorded word lists were 
played back through (1) a high pass RC filter, (2) an infinite peak 
clipper, (3) one of the time quantizers, and (4) a low pass RC filter 
to the listeners’ headsets. The low pass filter affected the quality 
but not the intelligibility of the speech. 


of proficiency in understanding distorted speech that is 


reflected in Fig. 3. Before the tests of Fig. 3 were begun, - 


they had taken 223 tests in which they heard various 
kinds of distorted speech. During that time, their scores 
for method A, on which they were tested at the be- 
ginning and at the end of the training period, increased 
by a factor of 1.5. 

Despite these qualifications, amplitude- and time- 
quantized speech is a medium in which communication 
is possible. With method B and 10,000 quanta per 
second, the trained listeners recorded 97.5 percent of the 
Psycho-Acoustic Laboratory spondaic dissyllable lists? 
correctly on first hearing. Untrained listeners executed 
simple commands transmitted with 6000 quanta per 
second by method A and 5000 by method B. In short, 
amplitude- and time-quantized speech is sufficiently 
intelligible for its simple geometrical form to make it 
interesting. Two examples will illustrate advantages 
offered by the simplicity of form. 


INTELLIGIBILITY AND INFORMATION CONTENT 


The first example is an elementary application of 
information theory. It provides a first-order explanation 
of the superiority of method B over method A. 

The Wiener-Shannon‘ formulation sets the amount of 
information per quantum at 


H/quantum= — p logep—g logag, (1) 


where # is the probability of switching and g=1— 1s 
the probability of not switching. H/quantum is maximal 
when p=q=0.5; it is very small if either p or q has a 
value near 1.0. Using subscripts to distinguish the two 
methods, we note that 4 approaches 1.0 as the quanta 
grow long, whereas pz remains in the neighborhood of 
0.5. The observed superiority of method B is therefore 
paralleled by a theoretical superiority: a wave of type B 
is capable of transmitting more information than is a 
wave of type A. 

3 The spondee lists were published by C. V. Hudgins é al., in 
Laryngoscope 57, 57-89 (1947). 

4N. Wiener, Cybernetics (Technology Press and John Wiley and 


Sons, Inc., New York, 1948), p. 75. C. E. Shannon, Bell System 
Tech. J. 27, 379-443, 623-656 (1948). 
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Instead of inferring values of pa and pp from a con- 
sideration of how methods A and B must behave as the 
quanta are made very long, we may estimate p,4 and ps 
from relative frequencies of switching and not switching 
in photographed samples of quantized speech. This has 
been done, as a function of the length of the quantal 
interval, for the sentence, “Joe took father’s shoebench 
out.” If the counts are based only on the consonants, it 
turns out that the ratio of the articulation scores for 
methods A and B is approximately equal to the ratio of 
the two values of H/quantum. Thus, in the present case 
at least, application of the information formula based 
only on first-order probabilities accounts rather well for 
the observed differences in intelligibility between two 
signals. 


AUTOCORRELATION FUNCTIONS 


- The second example is the computation of auto- 
correlation functions. Autocorrelation has recently be- 
come a practical as well as topical tool in the field of 
communication, but the determination of the auto- 
correlation function of a signal such as a sample of 
ordinary speech requires considerable instrumentation.® 
With amplitude-dichotomized, time-quantized speech, 
the determination is reduced essentially to counting. 
The autocorrelation function of f(t) is defined® as 


T/2 


SOfGE+7)dt, (2) 


—T/2 


o()=— 
T 


where 7 is the variable interval by which f(#) is advanced 
to produce f(f+7), and —7T/2<‘<T/2 is the interval 
of time in which f(é) is defined. If we designate the 
two amplitude levels of f(¢) as 1 and 0, expression (2) 
degenerates into 


$(r)=N,-(1, 1)/N, (3) 


in which N,(1,1) is the number of instances in which 
both f(t) and f(t+-7) have the value 1, and N is the total 
number of instances examined. 
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Fic. 3. Results of articulation tests. For the two methods of 
quantizing, average word articulation scores are plotted against 
the number of thousands of quanta per second. Each datum point 
is based on six 50-word tests, each with five listeners. 


5 Y. W. Lee and J. B. Wienser, Electronics 23, 86-92 (1950). 

6 4(r) is rigorously defined as the limit, as T approaches infinity, 
of the time average specified in expression (2). In practical work, 
however, it is of course necessary to average over a finite interval. 


TIME-QUANTIZED SPEECH 


¢(r) is the unnormalized autocorrelation function. It 
might more properly be called the autocovariance func- 
tion. If we normalize it to determine the more familiar 
product-moment correlation 
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we find that the simplification provided by quantization 
is no less great: 


N,; 1,1)—N(1 N 
5 Nes DOIN 3 
N(1)—N?(1)/N 


N(1) is simply the number of 1’s in N quanta. The right- 
hand term of the numerator and the whole denominator 
are therefore independent of + and require only one 
counting. 

Besides simplicity, the procedures described by ex- 
pressions (3) and (5) have the advantages of digital 
accuracy and complete determination. Counting admits 
of no error of measurement. No approximation is intro- 
duced by sampling 7 only at quantal intervals because 
$(r) of necessity follows a linear path between sample 
points. 

Normalized autocorrelation functions for five conso- 
nants and one vowel are shown in Fig. 4. The counts 
were made with the aid of a coincidence counter de- 
scribed by E. B. Newman.’ The sounds are from “Joe 
took father’s shoebench out,’ quantized at a rate of 
30,000 per second. The fact that the functions oscillate 
more rapidly for the consonants than for the vowel 
indicates that the dominant components of the former 
are higher in frequency than the formants of the latter. 

It is of interest to compare the functions shown in 
Fig. 4 with short-time autocorrelation functions of 
normal speech described by K. N. Stevens. The 
principal differences are that the functions for quantized 


7E. B. Newman (to be published). 
8K, N. Stevens, J. Acous. Soc. Am. 22, 769 (1950). 
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Fic. 4. Autocorrelation functions. The correlation between the 
amplitude of the quantized wave at time ¢ and the amplitude at 
time ¢+-7 is shown as a function of 7 for-sounds from the sentence, 
“Joe took father’s shoebench out.” In the popes plots, the 
heavy line represents the first utterance, the light line a second by 
the same talker. In the plot for ch, the dashed curve represents the 
first half (48 milliseconds) of the sound, the dotted curve the 
second half. The fact that the oscillations of the autocorrelation 
function are stronger and more regular for the halves taken 
separately than for the over-all sound (solid line) suggests that the 
time average must be taken over a rather short interval if detail is 
not to be lost. 


speech oscillate in a simpler and more regular manner 
and that they decline more rapidly with increasing 7. 
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